Search CORE

On the complexity of scheduling checkpoints for computational workflows

Author: Robert Yves
Vivien Frédéric
Zaidouni Dounia
Publication venue: HAL CCSD
Publication date: 01/01/2012
Field of study

This paper deals with the complexity of scheduling computational workflows in the presence of Exponential failures. When such a failure occurs, rollback and recovery is used so that the execution can resume from the last checkpointed state. The goal is to minimize the expected execution time, and we have to decide in which order to execute the tasks, and whether to checkpoint or not after the completion of each given task. We show that this scheduling problem is strongly NP-complete, and propose a (polynomial-time) dynamic programming algorithm for the case where the application graph is a linear chain. These results lay the theoretical foundations of the problem, and constitute a prerequisite before discussing scheduling strategies for arbitrary DAGS of moldable tasks subject to general failure distributions

On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing

Author: Casanova Henri
Robert Yves
Vivien Frédéric
Zaidouni Dounia
Publication venue: 'Elsevier BV'
Publication date: 01/10/2015
Field of study

International audienceProcessor failures in post-petascale parallel computing platforms are common occurrences. The traditional fault-tolerance solution, checkpoint-rollback-recovery, severely limits parallel efficiency. One solution is to replicate application processes so that a processor failure does not necessarily imply an application failure. Process replication, combined with checkpoint-rollback-recovery, has been recently advocated. We first derive novel theoretical results for Exponential failure distributions, namely exact values for the Mean Number of Failures To Interruption and the Mean Time To Interruption. We then extend these results to arbitrary failure distributions, obtaining closed-form solutions for Weibull distributions. Finally, we evaluate process replica-tion in simulation using both synthetic and real-world failure traces so as to quantify average application makespan. One interesting result from these experiments is that, when process repli-cation is used, application performance is not sensitive to the checkpointing period, provided that that period is within a large neighborhood of the optimal period. More generally, our empirical results make it possible to identify regimes in which process replication is beneficial

Cost-Optimal Execution of Trees of Boolean Operators with Shared Streams

Author: Casanova Henri
Lim Lipyeow
Robert Yves
Vivien Frédéric
Zaidouni Dounia
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

The processing of queries expressed as trees of boolean operators applied to predicates on sensor data streams has several applications in mobile computing. Sensor data must be retrieved from the sensors to a query processing device, such as a smartphone, over one or more network interfaces. Retrieving a data item incurs a cost, e.g., an energy expense that depletes the smartphone's battery. Since the query tree contains boolean operators, part of the tree can be shortcircuited depending on the retrieved sensor data. An interesting problem is to determine the order in which predicates should be evaluated so as to minimize the expected query processing cost. This problem has been studied in previous work assuming that each data stream occurs in a single predicate. In this work we remove this assumption since it does not necessarily hold for real-world queries. Our main results are an optimal algorithm for single-level trees and a proof of NP-completeness for DNF trees. For DNF trees, however, we show that there is an optimal predicate evaluation order that corresponds to a depth-first traversal. This result provides inspiration for a class of heuristics. We show that one of these heuristics largely outperforms other sensible heuristics, including the one heuristic proposed in previous work for our general version of the query processing problem.Le traitement de requêtes, exprimées sous forme d'arbres d'opérateurs booléens appliqués à des prédicats sur des flux de données de senseurs, a de nombreuses applications dans le domaine du calcul mobile. Les données doivent être transférées des senseurs vers l'appareil de traitement des données, par exemple un {smartphone}. Transférer une donnée induit un coût, par exemple une consommation énergétique qui diminuera la charge de la batterie du smartphone. Comme l'arbre de requêtes contient des opérateurs booléens, des pans de l'arbre peuvent être court-circuités en fonction des données récupérées. Un problème intéressant est de déterminer l'ordre dans lequel les prédicats doivent être évalués afin de minimiser l'espérance du coût du traitement de la requête. Ce problème a déjà été étudié sous l'hypothèse que chaque flux apparaît dans un seul prédicat. Dans le présent travail nous éliminons cette hypothèse qui ne correspond pas forcément à la réalité. Nos principaux résultats sont un algorithme optimal pour les arbres avec un seul niveau, et une preuve de NP-complétude pour les arbres sous forme normale disjonctive. Pour les arbres sous forme normale disjonctive, cependant, nous montrons qu'il existe un ordre optimal d'évaluation des prédicats qui correspond à un parcours en profondeur d'abord. Ce résultat nous sert à concevoir toute une classe d'heuristiques. Nous montrons que l'une de ces heuristiques a de bien meilleurs résultats que les autres heuristiques et, entre autres, que la seule heuristique précédemment proposée pour le cadre général

Cost-Optimal Execution of Boolean Query Trees with Shared Streams

Author: Casanova Henri
Lim Lipyeow
Robert Yves
Vivien Frédéric
Zaidouni Dounia
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/05/2014
Field of study

International audienceThe processing of queries expressed as trees of boolean operators applied to predicates on sensor data streams has several applications in mobile computing. Sensor data must be retrieved from the sensors, which incurs a cost, e.g., an energy expense that depletes the battery of a mobile query processing device. The objective is to determine the order in which predicates should be evaluated so as to shortcut part of the query evaluation and minimize the expected cost. This problem has been studied assuming that each data stream occurs at a single predicate. In this work we remove this assumption since it does not necessarily hold for real-world queries. Our main results are an optimal algorithm for single-level trees and a proof of NP-completeness for DNF trees. For DNF trees, however, we show that there is an optimal predicate evaluation order that corresponds to a depth-first traversal. This result provides inspiration for a class of heuristics. We show that one of these heuristics largely outperforms other sensible heuristics, including a heuristic proposed in previous work

Crossref

arXiv.org e-Print Archive

Impact of fault prediction on checkpointing strategies

Author: Dounia Zaidouni
Frédéric Vivien
Frédéric Vivien
Guillaume Aupy
Guillaume Aupy
Yves Robert
Yves Robert
Publication venue
Publication date: 01/01/2012
Field of study

This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical analysis of Young and Daly in the presence of a fault prediction system, which is characterized by its recall and its precision, and which provides either exact or window-based time predictions. We succeed in deriving the optimal value of the checkpointing period (thereby minimizing the waste of resource usage due to checkpoint overhead) in all scenarios. These results allow to analytically assess the key parameters that impact the performance of fault predictors at very large scale. In addition, the results of this analytical evaluation are nicely corroborated by a comprehensive set of simulations, thereby demonstrating the validity of the model and the accuracy of the results.Comment: 20 page

Comments on ''Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpoint''

Author: Aupy Guillaume
Robert Yves
Vivien Frédéric
Zaidouni Dounia
Publication venue: HAL CCSD
Publication date: 21/06/2013
Field of study

In this short note, we provide some comments on the recent paper ''Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing'' by Bouguerra et al.. We start by identifying some errors in their equations. Then we explain that they do not actually use the distribution of lead times, contrary to statements by the authors. Finally, we show that their algorithm does not change policy at the best possible moment, and we point to our own work~\cite{rr-journal-prediction} for the (correct version of the) optimal algorithm.Dans cette courte note nous commentons l'article ''Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing'' de Bouguerra et al.~\cite{SlimIPDPS13}. Nous commençons par identifier des erreurs dans la mise en équation du problème. Nous expliquons ensuite que, contrairement à ce qu'ils prétendent, les auteurs n'utilisent pas la distribution du délai de prédiction (\emph{lead time}). Finalement, nous montrons que leur algorithme ne change pas de politique au moment optimum, et nous indiquons que nous avons présenté l'algorithme optimal dans un précédent rapport de recherche

Using group replication for resilience on exascale systems

Author: Dounia Zaidouni
Dounia Zaidouni
Frédéric Vivien
Henri Casanova
Henri Casanova
Marin Bougeret
Marin Bougeret
Yves Robert
Yves Robert
Publication venue
Publication date: 09/12/2011
Field of study

High performance computing applications must be resilient to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-recovery, by which the application saves its state to secondary storage throughout execution and recovers from the latest saved state in case of a failure. An oft studied research question is that of the optimal checkpointing strategy: when should state be saved? Unfortunately, even using an optimal checkpointing strategy, the checkpointing frequency must increase as platform scale increases, leading to higher checkpointing overhead. This overhead precludes high parallel efficiency for large-scale platforms, thus mandating other more scalable fault-tolerance mechanisms. One such mechanism is replication, which can be used in addition to checkpoint-recovery. Using replication, multiple processors perform the same computation so that a processor failure does not necessarily imply application failure. While at first glance replication may seem wasteful, it may be significantly more efficient than using solely checkpointrecovery at large scale. In this work we investigate a simple approach where entire application instances are replicated. We provide a theoretical study of checkpoint-recovery with replication in terms of expected application execution time, under an exponential distribution of failures. We design dynamic-programming based algorithms to define checkpointing dates that work under any failure distribution. We also conduct simulation experiments assuming that failures follow Exponential or Weibull distributions, the latter being more representative of real-world systems, and using failure logs from production clusters. Our results show that replication is useful in a variety of realistic application and checkpointing cost scenarios for future exascale platforms.